CUNEF¶

Master Universitario en Ciencia de Datos¶

Dep: Aprendizaje Automatico

Practica 3\ Yelp Analysis

Case of Study:¶

Create a Machine Learning Algorithm to identify possible business insigts based on different information related to the businesses, the reviews assigned by users and the different types of users from a dataset extracted from an open datasource of Yelp page.

Import the libraries that we are going to use¶

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
from matplotlib.pyplot import *
import json

# import geopandas as gpd
import plotly.express as px

from sklearn.preprocessing import OrdinalEncoder

import warnings
warnings.filterwarnings('ignore')

1. LOADING THE DATASETS¶

1.1 Dataset de Business¶

In [3]:
url_business="C:\\Users\\karla\\Documents\\CUNEF\\5_machine_learning\\2_practicas\\entregables\\Yelp_ML\\00_data\\00_raw\\yelp_academic_dataset_business.json"

df_business=pd.read_json(url_business,
                         lines=True,
                         orient='columns')
In [4]:
print('The shape of the dataset of business contains', str(df_business.shape[0]),'rows and ',str(df_business.shape[1]),\
      ' columns')
print('Here is a list of the first 5 rows of the dataset')
df_business.head(5)
The shape of the dataset of business contains 150346 rows and  14  columns
Here is a list of the first 5 rows of the dataset
Out[4]:
business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
0 Pns2l4eNsfO8kk83dixA6A Abby Rappoport, LAC, CMQ 1616 Chapala St, Ste 2 Santa Barbara CA 93101 34.426679 -119.711197 5.0 7 0 {'ByAppointmentOnly': 'True'} Doctors, Traditional Chinese Medicine, Naturop... None
1 mpf3x-BjTdTEA3yCZrAYPw The UPS Store 87 Grasso Plaza Shopping Center Affton MO 63123 38.551126 -90.335695 3.0 15 1 {'BusinessAcceptsCreditCards': 'True'} Shipping Centers, Local Services, Notaries, Ma... {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ...
2 tUFrWirKiKi_TAnsVWINQQ Target 5255 E Broadway Blvd Tucson AZ 85711 32.223236 -110.880452 3.5 22 0 {'BikeParking': 'True', 'BusinessAcceptsCredit... Department Stores, Shopping, Fashion, Home & G... {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ...
3 MTSW4McQd7CbVtyjqoe9mw St Honore Pastries 935 Race St Philadelphia PA 19107 39.955505 -75.155564 4.0 80 1 {'RestaurantsDelivery': 'False', 'OutdoorSeati... Restaurants, Food, Bubble Tea, Coffee & Tea, B... {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ...
4 mWMc6_wTdE0EUBKIGXDVfA Perkiomen Valley Brewery 101 Walnut St Green Lane PA 18054 40.338183 -75.471659 4.5 13 1 {'BusinessAcceptsCreditCards': 'True', 'Wheelc... Brewpubs, Breweries, Food {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2...
In [5]:
df_business.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150346 non-null  object 
 1   name          150346 non-null  object 
 2   address       150346 non-null  object 
 3   city          150346 non-null  object 
 4   state         150346 non-null  object 
 5   postal_code   150346 non-null  object 
 6   latitude      150346 non-null  float64
 7   longitude     150346 non-null  float64
 8   stars         150346 non-null  float64
 9   review_count  150346 non-null  int64  
 10  is_open       150346 non-null  int64  
 11  attributes    136602 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 16.1+ MB

- Information in the dataframe: \ The complete dataframe of business consist on 14 rows with information of each business that is part of the Yelp. In this dataset we can find information about the different business that are in the yelp platform: the location of the stablishment, the name of the business, the type of business, amoung other information highly important to this analyzis.

1.2 Dataset de Review¶

Because the size of the data is really big, we are going to use the function chunksize that will help filter different small amount of rows to select the data that will be worked. We used a number of 1000 chunksize of fata to handle the dataset.

In [6]:
# Crear una lista vacia con los nuevos valores que van a estar formando el dataset
chunks = []
size = 1000

# Specify dtypes 
r_dtypes = {"stars": np.float16, 
            "useful": np.int32, 
            "funny": np.int32,
            "cool": np.int32} 

reader = pd.read_json("C:\\Users\\karla\\Documents\\CUNEF\\5_machine_learning\\2_practicas\\entregables\\Yelp_ML\\00_data\\00_raw\\yelp_academic_dataset_review.json",
                      lines=True,
                      chunksize=size,
                      dtype= r_dtypes,
                      orient='records')

for chunk in reader:
        reduced_chunk = chunk.drop(columns=['review_id'])\
                             .query("`date` >= '2006-12-31'")
        chunks.append(reduced_chunk)
        

# Concatenate each chunk into a single dataframe        
df_reviews = pd.concat(chunks, ignore_index=True)
In [7]:
print(f'The size of the dataset is ',str(df_reviews.shape[0]),'columns, with ',str(df_reviews.shape[1]),'rows')
print('Here is a list of the 5 first review information')
df_reviews.head(5)
The size of the dataset is  6985581 columns, with  8 rows
Here is a list of the 5 first review information
Out[7]:
user_id business_id stars useful funny cool text date
0 mh_-eMZ6K5RLWhZyISBhwA XQfwVwDr-v0ZS3_CbbE5Xw 3.0 0 0 0 If you decide to eat here, just be aware it is... 2018-07-07 22:09:11
1 OyoGAe7OKpv6SyGZT5g77Q 7ATYjTIgM3jUlt4UM3IypQ 5.0 1 0 1 I've taken a lot of spin classes over the year... 2012-01-03 15:28:18
2 8g_iMtfSiwikVnbP2etR0A YjUWPpI6HXG530lwP-fb2A 3.0 0 0 0 Family diner. Had the buffet. Eclectic assortm... 2014-02-05 20:30:30
3 _7bHUi9Uuf5__HHc_Q8guQ kxX2SOes4o-D3ZQBkiMRfA 5.0 1 0 1 Wow! Yummy, different, delicious. Our favo... 2015-01-04 00:01:03
4 bcjbaE6dDog4jkNY91ncLQ e4Vwtrqf-wpJfwesgvdgxQ 4.0 1 0 1 Cute interior and owner (?) gave us tour of up... 2017-01-14 20:54:15
In [8]:
df_reviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6985581 entries, 0 to 6985580
Data columns (total 8 columns):
 #   Column       Dtype         
---  ------       -----         
 0   user_id      object        
 1   business_id  object        
 2   stars        float16       
 3   useful       int32         
 4   funny        int32         
 5   cool         int32         
 6   text         object        
 7   date         datetime64[ns]
dtypes: datetime64[ns](1), float16(1), int32(3), object(3)
memory usage: 306.5+ MB

- Information in the dataframe: \ The dataframe "review" contains the information about the different reviews and ratings that different users assign to the different business. It contains the stars given to different busineess, the user_id that placed the review and the date which the review was placed amoung other information. This dataset can also give a lot of insights on the reviews made by each user.

1.3 Dataset de Checkin¶

In [9]:
url_check= "C:\\Users\\karla\\Documents\\CUNEF\\5_machine_learning\\2_practicas\\entregables\\Yelp_ML\\00_data\\00_raw\\yelp_academic_dataset_checkin.json"

df_checkin=pd.read_json(url_check,
                        lines=True,
                        orient='columns')
In [10]:
print(f'The size of the dataset is ',str(df_checkin.shape[0]),'columns, with ',str(df_checkin.shape[1]),'rows')
print('Here is a list of the 5 first checkin information')
df_checkin.head(5)
The size of the dataset is  131930 columns, with  2 rows
Here is a list of the 5 first checkin information
Out[10]:
business_id date
0 ---kPU91CF4Lq2-WlRu9Lw 2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020...
1 --0iUa4sNDFiZFrAdIWhZQ 2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011...
2 --30_8IhuyMHbSOcNWd6DQ 2013-06-14 23:29:17, 2014-08-13 23:20:22
3 --7PUidqRWpRSpXebiyxTg 2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012...
4 --7jw19RH9JKXgFohspgQw 2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014...
In [11]:
df_checkin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131930 entries, 0 to 131929
Data columns (total 2 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   business_id  131930 non-null  object
 1   date         131930 non-null  object
dtypes: object(2)
memory usage: 2.0+ MB

- Information in the dataframe: \ This dataset only provides information about the business_id and the date which that specific business check in in Yelp.

1.4 Dataset de User¶

Because the size of the data is really big, we are going to use the function chunksize that will help filter different small amount of rows to select the data that will be worked. We used a number of 1000 chunksize of fata to handle the dataset.

In [12]:
# Crear una lista vacia con los nuevos valores que van a estar formando el dataset
chunks = []
size = 100
num=0

reader_user = pd.read_json("C:\\Users\\karla\\Documents\\CUNEF\\5_machine_learning\\2_practicas\\entregables\\Yelp_ML\\00_data\\00_raw\\yelp_academic_dataset_user.json",
                      lines=True,
                      chunksize=size,
                      orient='columns')

for chunk in reader_user:
    if (num % 2) == 0:
        chunks.append(chunk)
        num=num+1
    else:
        num=num+1
        
df_user = pd.concat(chunks, ignore_index=True)
In [13]:
print(f'The size of the dataset is ',str(df_user.shape[0]),'columns, with ',str(df_user.shape[1]),'rows')
print('Here is a list of the 5 first User information')
df_user.head(5)
The size of the dataset is  993997 columns, with  22 rows
Here is a list of the 5 first User information
Out[13]:
user_id name review_count yelping_since useful funny cool elite friends fans ... compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
0 qVc8ODYU5SZjKXVBgXdI7w Walker 585 2007-01-25 16:47:26 7217 1259 5994 2007 NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA... 267 ... 65 55 56 18 232 844 467 467 239 180
1 j14WgRoU_-2ZE1aw1dXrJg Daniel 4333 2009-01-25 04:35:42 43091 13066 27281 2009,2010,2011,2012,2013,2014,2015,2016,2017,2... ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A... 3138 ... 264 184 157 251 1847 7054 3131 3131 1521 1946
2 2WnXYQFK0hXEoTxPtV2zvg Steph 665 2008-07-25 10:41:00 2086 1010 1003 2009,2010,2011,2012,2013 LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA... 52 ... 13 10 17 3 66 96 119 119 35 18
3 SZDeASXq7o05mMNLshsdIA Gwen 224 2005-11-29 04:38:33 512 330 299 2009,2010,2011 enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg... 28 ... 4 1 6 2 12 16 26 26 10 9
4 hA5lMy-EnncsH4JoR-hFGQ Karen 79 2007-01-05 19:40:59 29 15 7 PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA... 1 ... 1 0 0 0 1 1 0 0 0 0

5 rows × 22 columns

In [14]:
print(f'The size of the dataset is ',str(df_user.shape[0]),'columns, with ',str(df_user.shape[1]),'rows')
print('Here is a list of the 5 first User information')
df_user.head(5)
The size of the dataset is  993997 columns, with  22 rows
Here is a list of the 5 first User information
Out[14]:
user_id name review_count yelping_since useful funny cool elite friends fans ... compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos
0 qVc8ODYU5SZjKXVBgXdI7w Walker 585 2007-01-25 16:47:26 7217 1259 5994 2007 NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA... 267 ... 65 55 56 18 232 844 467 467 239 180
1 j14WgRoU_-2ZE1aw1dXrJg Daniel 4333 2009-01-25 04:35:42 43091 13066 27281 2009,2010,2011,2012,2013,2014,2015,2016,2017,2... ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A... 3138 ... 264 184 157 251 1847 7054 3131 3131 1521 1946
2 2WnXYQFK0hXEoTxPtV2zvg Steph 665 2008-07-25 10:41:00 2086 1010 1003 2009,2010,2011,2012,2013 LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA... 52 ... 13 10 17 3 66 96 119 119 35 18
3 SZDeASXq7o05mMNLshsdIA Gwen 224 2005-11-29 04:38:33 512 330 299 2009,2010,2011 enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg... 28 ... 4 1 6 2 12 16 26 26 10 9
4 hA5lMy-EnncsH4JoR-hFGQ Karen 79 2007-01-05 19:40:59 29 15 7 PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA... 1 ... 1 0 0 0 1 1 0 0 0 0

5 rows × 22 columns

In [15]:
df_user.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 993997 entries, 0 to 993996
Data columns (total 22 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   user_id             993997 non-null  object 
 1   name                993997 non-null  object 
 2   review_count        993997 non-null  int64  
 3   yelping_since       993997 non-null  object 
 4   useful              993997 non-null  int64  
 5   funny               993997 non-null  int64  
 6   cool                993997 non-null  int64  
 7   elite               993997 non-null  object 
 8   friends             993997 non-null  object 
 9   fans                993997 non-null  int64  
 10  average_stars       993997 non-null  float64
 11  compliment_hot      993997 non-null  int64  
 12  compliment_more     993997 non-null  int64  
 13  compliment_profile  993997 non-null  int64  
 14  compliment_cute     993997 non-null  int64  
 15  compliment_list     993997 non-null  int64  
 16  compliment_note     993997 non-null  int64  
 17  compliment_plain    993997 non-null  int64  
 18  compliment_cool     993997 non-null  int64  
 19  compliment_funny    993997 non-null  int64  
 20  compliment_writer   993997 non-null  int64  
 21  compliment_photos   993997 non-null  int64  
dtypes: float64(1), int64(16), object(5)
memory usage: 166.8+ MB

- Information in the dataframe: \ In this dataset, the focus is on the users that are giving reviews to the different business that are in the yelp platfomr. Their user_id, how many reviews did they placed, how long each user has using the yelp platform, their frineds, among other information.

This dataset can be useful to evaluate and segmentate the different types/class of users that are using the yelp platform.

1.5 Dataset de Tip¶

In [16]:
url_tip="C:\\Users\\karla\\Documents\\CUNEF\\5_machine_learning\\2_practicas\\entregables\\Yelp_ML\\00_data\\00_raw\\yelp_academic_dataset_tip.json"

df_tip=pd.read_json(url_tip,
                    lines=True,
                    orient='columns')
In [17]:
print(f'The size of the dataset is ',str(df_tip.shape[0]),'columns, with ',str(df_tip.shape[1]),'rows')
print('Here is a list of the 5 first Tip information')
df_tip.head(5)
The size of the dataset is  908915 columns, with  5 rows
Here is a list of the 5 first Tip information
Out[17]:
user_id business_id text date compliment_count
0 AGNUgVwnZUey3gcPCJ76iw 3uLgwr0qeCNMjKenHJwPGQ Avengers time with the ladies. 2012-05-18 02:17:21 0
1 NBN4MgHP9D3cw--SnauTkA QoezRbYQncpRqyrLH6Iqjg They have lots of good deserts and tasty cuban... 2013-02-05 18:35:10 0
2 -copOvldyKh1qr-vzkDEvw MYoRNLb5chwjQe3c_k37Gg It's open even when you think it isn't 2013-08-18 00:56:08 0
3 FjMQVZjSqY8syIO-53KFKw hV-bABTK-glh5wj31ps_Jw Very decent fried chicken 2017-06-27 23:05:38 0
4 ld0AperBXk1h6UbqmM80zw _uN0OudeJ3Zl_tf6nxg5ww Appetizers.. platter special for lunch 2012-10-06 19:43:09 0
In [18]:
df_tip.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 908915 entries, 0 to 908914
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   user_id           908915 non-null  object        
 1   business_id       908915 non-null  object        
 2   text              908915 non-null  object        
 3   date              908915 non-null  datetime64[ns]
 4   compliment_count  908915 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 34.7+ MB

- Information in the dataframe: \ This dataset shows information of the Tip that is a small review made to the different businesses.

Summary of Dataset Review¶

  • Because of the high amount of data in the datasets of Review and Users, we needed to use methodds to summarize the data to be able to analyze in the computer without crashing the computer program. Also there are some variables that are not correct, depending on the datasets that will be used we are going to make the appropiate transformations.

2. Analyze the datasets¶

2.1 Missing values¶

We are going to verify if there are any missing or null values inside the datasets that we are going to continue working with

  • Missing values at business dataset
In [19]:
# Verify the null values in the dataframe of business
null_col=df_business.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()

# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_business)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
Out[19]:
index null_values total_rows miss_percentage
0 hours 23223 150346 15.446
1 attributes 13744 150346 9.142
2 categories 103 150346 0.069
  • Missing values at review dataset
In [20]:
# Verify the null values in the dataframe of business
null_col=df_reviews.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()

# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_reviews)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
Out[20]:
index null_values total_rows miss_percentage
  • Missing values at checkin dataset
In [21]:
# Verify the null values in the dataframe of business
null_col=df_checkin.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()

# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_checkin)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
Out[21]:
index null_values total_rows miss_percentage
  • Missing values at user dataset
In [22]:
# Verify the null values in the dataframe of business
null_col=df_user.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()

# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_user)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
Out[22]:
index null_values total_rows miss_percentage
  • Missing values at tip dataset
In [23]:
# Verify the null values in the dataframe of business
null_col=df_tip.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()

# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_tip)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
Out[23]:
index null_values total_rows miss_percentage

Summary\ The only dataset that is going to be used and have missing values is the dataset of business with information missing in the variables of hours (the hours which the business is open), attribute and categories.

  • Imputing missing data

We are going to handle the missing values as the following:\ 1. category <- for the rows with missing values in category we are going to drop those rows, as the number of misssing values is only 0.069% \ 2. hours & attributes <- we are going to impute the missing values with a 'None' message to specify that there were missing information on those particular cells as for the hour column there are more than 15% of values missing information. As we dont know why the information is missing we are going to remove the NaN but specifying that the original information was not recorded.

In [24]:
# fill the missing values with 'None' in hours and attribute rows
df_business['hours']=df_business['hours'].fillna('None')
df_business['attributes']=df_business['attributes'].fillna('None')

# Dropping the rows with missing values in category
df_business.dropna(axis = 0, inplace = True)
  • Confirming the handle of missing values
In [25]:
# Verify the null values in the dataframe of business
null_col=df_business.isnull().sum().sort_values(ascending=False).rename('null_values').reset_index()

# Create table with all rows and percentage of missing values per variable
null_col['total_rows']=len(df_business)
null_col["miss_percentage"]=round((null_col['null_values']/null_col['total_rows']).mul(100),3)
null_col=null_col[null_col["miss_percentage"]>0]
null_col
Out[25]:
index null_values total_rows miss_percentage
In [26]:
df_business.shape
Out[26]:
(150243, 14)

2.2 Explore some of the variables of the datasets¶

- Dataset of Business¶

a. Explore the different kinds of business category and the most common categories inside yelp database

In [27]:
# Create the new dataframe to verify the most frequent categories
business_categ=(', '.join(df_business['categories'].dropna())) #create a series of all possible categories unique
business_categ=pd.DataFrame(business_categ.split(','),columns=['category']) # divide the series to rows and convert to DataFrame
business_categ=business_categ.value_counts().to_frame().reset_index() #count the number of each category
business_categ.columns=[ 'category','num_business'] #rename the columns

# Plot the created dataset in a graph to visualize better the information
plt.figure(figsize=(15,10))
ax=sns.barplot(data=business_categ.iloc[0:15], y= 'category', x='num_business', palette="crest")
ax.set_ylabel('Category')
ax.set_xlabel('Number of business')
ax.set_title('Top 15 Business Categories in Yelp Database')

# Add the labels to each bar in the  
for p in ax.patches:
    ax.annotate(int(p.get_width()),
                ((p.get_x() + p.get_width()),
                 p.get_y()),
                 xytext=(1, -18),
                fontsize=12,
                color='black',
                textcoords='offset points',
                horizontalalignment='left')   
    
plt.show() 
  • Insights

The top category of business in the yelp platform is restaurant with 52268 different restaurants in the database. After restaurants, the food is the second category with more business that are focus on it. This information of the different categories of business and which are the most frecuent business categories that can be found in the Yelp Dataset can help to identify which categories we can focus first in the analizys.

For example if we want to identify which can be the different kinds of customers that assign different starst puntuation to the different business we can start with the restaurant categories that is the primer business in the yelp platform.

b. Visualization of the Number of reviews by state

In [28]:
# Plot the graph to see the states with the most reviews written to different business
fig=px.choropleth(df_business[['state','review_count']].groupby('state').sum().reset_index(),
                  locations='state',
                  locationmode="USA-states",
                  scope='usa',
                  color='review_count',
                  color_continuous_scale="Viridis",
                  labels={'review_count':'Number reviews per State'}
                 )

fig.show()
  • Insights

With this geograpich map we can see where the majority of reviewx are written. If we want to segment on a specific state to evaluate the reviews one good state could by PA or the states that are on the East Coast of US (that are the states with the major amount of reviews), as is the state whit the majority number of reviews in the complete United States.

Also we could identify that from the total of states that were in the original dataset, there are some values that are not a United States state abbreviation. We are going to identify which are those 2 not US states.

In [30]:
# Describe the location of the file
url_state_label="C://Users//karla//Documents//CUNEF//5_machine_learning//2_practicas//entregables//Yelp_ML//00_data//00_raw//states_hash.json"

# Create the dataframe from the json file
state_label=pd.read_json(url_state_label, typ='series').to_frame().reset_index()
state_label.columns=['state','state_desc']
state_abbr=state_label['state'].to_list() # create a list with all the states abbreviation

# Create a list with all the unique values of state in the dataset
uni_states=df_business['state'].unique()

# Create new list that will save the values that are a US state abbreviation and the states that are not an abbreviation
no_state=[]
state_contain=[]

# Create a loop that goes through our unique values of state abbreviation and identify the values that are not an state
for elements in uni_states:
    if elements in state_abbr:
        state_contain.append(elements)
    else:
        no_state.append(elements)
In [31]:
# The states that are not part of the US are
no_state
Out[31]:
['AB', 'XMS']

From the total of 27 states abbreviation in the dataset, there are 2 that are not a US states. AB represents a state in Canada and XMS represents a state in England. For further analysis we can decide wheter or not to focus only in the US states.

c. Visualization of the Average star range by state

In [32]:
fig=px.choropleth(df_business[['state','stars']].groupby('state').mean().reset_index(),
                  locations='state',
                  locationmode="USA-states",
                  scope='usa',
                  color='stars',
                  color_continuous_scale="Viridis",
                  labels={'stars':'Average Star range per State'}
                 )

fig.show()
  • Insights

In here we are visualizing how the stars are being distributed to all the different states in the US.

- Dataset of Review¶

a. Evaluation of the behavior of the amount of reviews by month

In [33]:
review_series = df_reviews['date'].value_counts()
In [34]:
review_series.resample('M').sum().plot()
Out[34]:
<AxesSubplot: >
  • Insights

In this chart we can see the behavior of the amount of reviews placed in the Yelp platform by month. With this information, we could have insights to make future predictions about the amount of reviews that are going to be placed at yelp to specific categories.

From the chart we can see a decrease in the amount of reviews that could be as a result of the COVID pandemic in 2020.

b. Average starts based on the amount of reviews that users placed

In [35]:
plt.figure(figsize=(10,5))
ax = sns.countplot( x='stars' , data = df_reviews, palette='crest')
ax.set_ylabel('Count')
ax.set_xlabel('Stars')
ax.set_title('Average Star Rating by Reviews in the Yelp Database')

for p in ax.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax.text(x+width-.4, 
                y+height,
                '{:.0f}'.format(height),
                horizontalalignment='center') 
                
plt.show()
  • Insights

The average of stars rating per number of reviews does not follow a normal distribution because as we can see, we have a high amount of reviews with a rating of 5.

- Dataset of Users¶

a. Visualize the distribution of the average star given by users

In [36]:
plt.figure(figsize=(7,5))
ax = sns.histplot( x='average_stars' , data = df_user, palette='crest', binwidth=0.25)
ax.set_ylabel('Count')
ax.set_xlabel('Stars')
ax.set_title('Average star of assign by users')

for p in ax.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        ax.text(x+width-.4, 
                y+height,
                '{:.0f}'.format(height),
                horizontalalignment='center') 
                
plt.show()
  • Insights

By the plot graph, it is possible to see that the average range of starts assign by all the users of the yelp platform. We can see that the majority of the users give a high range of stars to the business and only a low proportion of users give an average of bad reviews.

b. Correlation between the some selected variables

In [37]:
df_new=(', '.join(df_user['friends']))
In [38]:
df_new
In [39]:
plt.figure(figsize=(10,8))

df_user_corr=df_user.copy()
df_user_corr=df_user_corr.drop(columns=['user_id','name','friends','yelping_since','elite','friends'], axis=1)

sns.heatmap(df_user_corr.corr())
Out[39]:
<AxesSubplot: >
  • Insights

With the correlation heatmap used in the user dataframe we can start havins a clue of some variables that are correlated with each other. This can help in the future having a better understanding of the different classes of users that are in the yelp dataset. Some variables are more correlated to others, but this can help us give an initial perspective of how the users variables are related from one to the other. This will be more helpful in the modeling, when doing a cluster of the users of the yelp dataframe.

2.3 Conclusions of the Exploratory Analysis of the Datasets¶

From the dataset provided by the Yelp company we can create different models depending on the business insight or the question that wants to be answered from the data. Some examples of possible business insights that can be obtain with the data are:

  1. Machine Learning model to classify the text reviews based on the probabilities of obtaining a 5 or a 1
  2. Unsupervise machine learning model to classify the different users based on their average star rating.
  3. Machine Learning model that depend on a business, their location, their characteristics and other variables classify if a business can be closed or not.
  4. Recommendation model of a specific category to a specific user.

We are going to focus on the modeling process on a ML algorithm to classify text and predict the star rating and in an unsupervise ML to classify users.

3.Preparation of the new datasets that will be used¶

As we are going to work with the user dataset, we are going to make some changes to the original dataset in order to have the correct information extract from them

In [40]:
# Create a copy of primery user dataset andDrop the user_id and the name as this information is not necessary
user_info=df_user.copy()
user_info=user_info.drop(['user_id','name'], axis=1)
In [41]:
# Create a function that will count all the values inside each list that are seperate by a comma
def count_val(column_name):
    num_elements=[]
    for elements in user_info[column_name]:
        if len(elements)==0:
            num_elements.append(0)
        else:
            num_elements.append(len(elements.split(',')))
    return num_elements
In [42]:
# Add the new column to the existing dataset of user_info and after drop the friends columns
user_info['num_friends']=count_val('friends')
user_info['times_elite']=count_val('elite')

# Eliminate the original columns of friends and elite from the dataset
user_info=user_info.drop(['friends','elite'],axis=1)
In [43]:
# Change the year columns to specify only the year on which each user enter the platform
user_info['yelping_since']=pd.to_datetime(user_info['yelping_since'],format='%Y-%m-%d %H:%M:%S') #convert from obj to date
user_info['yelping_since']=user_info['yelping_since'].dt.strftime('%Y').astype('int')
user_info
Out[43]:
review_count yelping_since useful funny cool fans average_stars compliment_hot compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos num_friends times_elite
0 585 2007 7217 1259 5994 267 3.91 250 65 55 56 18 232 844 467 467 239 180 14995 1
1 4333 2009 43091 13066 27281 3138 3.74 1145 264 184 157 251 1847 7054 3131 3131 1521 1946 4646 14
2 665 2008 2086 1010 1003 52 3.32 89 13 10 17 3 66 96 119 119 35 18 381 5
3 224 2005 512 330 299 28 4.27 24 4 1 6 2 12 16 26 26 10 9 131 3
4 79 2007 29 15 7 1 3.54 1 1 0 0 0 1 1 0 0 0 0 27 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
993992 23 2015 7 0 0 0 4.92 0 0 0 0 0 0 0 0 0 0 0 1 0
993993 1 2016 0 0 0 0 5.00 0 0 0 0 0 0 0 0 0 0 0 1 0
993994 4 2017 1 1 0 0 2.00 0 0 0 0 0 0 0 0 0 0 0 1 0
993995 2 2011 0 0 0 0 3.00 0 0 0 0 0 0 0 0 0 0 0 1 0
993996 2 2020 0 0 0 0 5.00 0 0 0 0 0 0 0 0 0 0 0 1 0

993997 rows × 20 columns

In [44]:
user_info.shape
Out[44]:
(993997, 20)
  • After the changes we can work with the new dataframe of user_info to create the cluster analysis of the different users of yelp database

4. Export the datasets to parquet¶

In [45]:
# Dataset of reviews
# df_reviews['stars']=df_reviews['stars'].astype(float)
# df_reviews.to_parquet("C:/Users/karla/Documents/CUNEF/5_machine_learning/2_practicas/entregables/Yelp_ML/00_data/01_process/reviews_data.parquet")

# Dataset of User_info
user_info['average_stars']=user_info['average_stars'].astype('float') # We do this to be able to save the dataset into a parquet
user_info.to_parquet("C:/Users/karla/Documents/CUNEF/5_machine_learning/2_practicas/entregables/Yelp_ML/00_data/01_process/user_info_data.parquet")